1 Tasks

*City names in Acfrs database are not obvious, i.e., does not contain a certain word to indicate it’s a city

*The census population data use GEOIDs, which are “numeric codes that uniquely identify all administrative/legal and statistical geographic areas for which the Census Bureau tabulates data”.

*ACFRs data has government_id (which is called census_id in the ACFRs portal), but not geo_id.

*Use a “middle file” to link geo_id and government_ID

*Need to a) identify city and towns in census population data and b) link those to corresponding acfrs entities.

2 Census Polulation - Cities and Towns

List of cities and towns with geo_id.

#Marc emailed this data Feb2 2022: using 017 Census master file data to map to the 2020 Census Population data using FIPS Codes. GEO_ID: Census identifiers that point to population of places in US
geoID_pop20_census <- rio::import(here::here("data", "City and Town Mapping.xlsx"), sheet = 2) %>% 
rename(population = `!!Total:`, 
       geo_id = GEO_ID) 

# join with df_state to get state.abb and state.name --> to have a common col with acfrs to join later
df_state <- data.frame(state.abb, state.name) %>% 
  add_row(state.abb = "DC", state.name = "District of Columbia")


pop <- geoID_pop20_census %>% 
  separate(NAME, c("city_town", "state.name"), sep = ",")  %>%  
  mutate(state.name = str_trim(state.name), 
         city_town = str_trim(city_town)) %>% 
  left_join(df_state) %>% drop_na()

datatable(pop)

There are 31615 observations in this data set.

2.1 Linking government_ID and geo_id

ACFRs database uses government ID (enumeration ID of local governments), but do not have geo_id.

We use a Census master file data to map ACFRs to the 2020 Census Population data by FIPS Codes.

#government_ID (used in ACFRs): enumeration ID of local governments. ACFRs data does not use FIP codes 
# Note that population data is ACFRs portal is 2017 -> don't use
# Note: census_id, which means government_id in Acfrs are numeric --> can be turn to scientific notation in R --> will not match --> need to avoid scientific notation. 

# This has both government_ID and geo_id
governmentID <- rio::import(here::here("data", "City and Town Mapping.xlsx"), sheet = 3) %>% 
  select(government_ID, `INFERRED GEO_ID`, NAME, CITY, STATE_AB, COUNTY_AREA_NAME)  %>% 
  rename(geo_id = `INFERRED GEO_ID`,  # Marc created INFERRED GEO_ID, which meant to be geo_id
         original_name = NAME)   
  
# Use governmentID to join with population
pop_governmentID <- pop %>% 
  left_join(governmentID) %>% drop_na(government_ID)

datatable(pop_governmentID %>% select(geo_id, government_ID, city_town, population))

3 City and Towns

*Get all general purpose entities in acfrs

*Join above acfrs entities with dataset of city and town population data that has governmentID

# census_id in ACFRs is government_ID used in file "City and Town Mapping.xlsx"

# 1. get all general purpose entities in acfrs
acfrs_governmentID <- readRDS("data/data_from_dbsite_2020.RDS") %>% 
  filter(category == "General Purpose") %>% 
  rename(government_ID = census_id) %>% 

# problem: government_ID in pop_governmentID has 14 characters (05201902700000) >< some government_ID in ACFRs has 13 characters 5201902700000 --> Marc: Yes, it is safe to assume that these government_ID in ACFRs data should have 0 at the beginning.
  mutate(government_ID = ifelse(str_length(government_ID) < 14, paste0("0", government_ID), government_ID)) 


# 2. join with city and town population data that has governmentID
acfrs_city_pop <- acfrs_governmentID %>% 
  left_join(pop_governmentID) %>% 
  select(-c(state, STATE_AB, year)) %>% 
  arrange(desc(population)) %>% 
  drop_na(population)

datatable(acfrs_city_pop)

At this point Acfrs now has 11682 general purpose entities FY 2020, of which 6810 are cities and towns.

3.1 Checking for overlapping

Check if there’s any overlapping between the above lists

# all general purpose entities: 11187
length(unique(acfrs_governmentID$government_ID)) 
## [1] 11187
unique(acfrs_governmentID$government_ID) -> t1

# cities and towns: 6810
length(unique(acfrs_city_pop$government_ID)) 
## [1] 6810
unique(acfrs_city_pop$government_ID) ->t2

# gap bw the 2 above: 4377
length(unique(setdiff(t1, t2)))
## [1] 4377
# list of counties (in count_gov.Rmd): 2569
county_pop_census_acfrs <- readRDS("county_pop_census_acfrs.RDS")
length(unique(county_pop_census_acfrs$census_id))
## [1] 2530
unique(county_pop_census_acfrs$census_id) ->t3

# entities overlapping  between city list and county list

intersect(t2, t3) -> intersect_city_county

# view the overlapping cities 
acfrs_city_pop %>% filter(government_ID %in% intersect_city_county) %>% 
 select(name, government_ID)
##                                                   name  government_ID
## 1                                    Macon-Bibb County 11201100100000
## 2                                        Marion County 15202700500000
## 3                                      Kankakee County 14204601000000
## 4                              Adrian (Lenawee County) 23204600200000
## 5                           Anaconda-Deer Lodge County 27201200100000
## 6                                    Shenandoah County 47207000200000
## 7                              Bronson (Branch County) 23201200100000
## 8                                       Amherst County 47200500100000
## 9                            Bangor (Van Buren County) 23208000100000
## 10                                       Louisa County 47205500100000
## 11                                   Appomattox County 47200600100000
## 12                           Bessemer (Gogebic County) 23202700100000
## 13                              Armada (Macomb County) 23205000100000
## 14                          Birch Run (Saginaw County) 23207300100000
## 15 North Sherman County Rural Fire Protection District 38202900100000
## 16                                     Buchanan County 47201200100000
## 17                          Beaverton (Gladwin County) 23202600100000
## 18                                      Halifax County 47204200200000
## 19                             Baroda (Berrien County) 23201100100000
## 20                            Benzonia (Benzie County) 23201000100000
## 21                                        Floyd County 47203200100000
## 22                          Casnovia (Muskegon County) 23206100100000
## 23                                        Surry County 47209100300000
## 24                                      Madison County 47205700100000
## 25                                   Washington County 47207900100000

Those whose names contain parentheses are actually cities. The part in side the parentheses indicates which county they belong to.

3.2 Final City and County List

state_gov <-readRDS("state_gov.RDS")

# composition of city, county, state gov in general purpose
 # acfrs_governmentID %>% 
 #  left_join(pop_governmentID) %>% #drop_na(population) %>% 
 #  select(-c(state, STATE_AB, year)) %>% 
 #  arrange(desc(population)) %>% 

# take out county 2569
# filter(!government_ID %in% county_pop_census_acfrs$census_id) %>% 
#    filter(!id %in% state_gov$id)

Lastly, take 15 who are actually counties out of the cities list.

# actually county
  acfrs_city_pop %>% filter(government_ID %in% intersect_city_county) %>% 
  filter(!grepl("\\)", name)) -> actual_county

acfrs_city_pop_final <- acfrs_city_pop %>% 
  filter(!government_ID %in% actual_county$government_ID) 

datatable(acfrs_city_pop_final)
 saveRDS(acfrs_city_pop_final, "acfrs_city_pop.RDS")

The list of city and county contains 6795 entities.